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ABSTRACT 

This  paper  presents  a  novel  multimodal  system  applied  to 
the  setup  and  control  of  distributed  interactive  simulations. 
We  have  developed  the  QuickSet  prototype,  a  pen/voice 
system  running  on  a  hand-held  PC,  communicating  through 
a  distributed  agent  architecture  to  NRaD’s1  LeatherNet 
system,  a  distributed  interactive  training  simulator  built  for 
the  US  Marine  Corps  (USMC).  The  paper  briefly  describes 
the  system  and  illustrates  its  use  in  multimodal  simulation 
setup. 
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gesture  recognition,  speech  recognition,  natural  language 
processing,  distributed  interactive  simulation. 

1.  INTRODUCTION 

In  order  to  train  personnel  more  effectively,  the  US  military 
is  developing  large-scale  distributed  simulation  capabilities. 
Begun  as  SIMNET  in  the  1980’s  [23],  these  distributed, 
interactive  environments  attempt  to  provide  a  high  degree  of 
fidelity  in  simulating  combat,  including  simulations  of  the 
individual  combatants,  the  equipment,  entity  movements, 
atmospheric  effects,  etc.  There  are  four  general  phases  of 
user  interaction  with  these  simulations:  Creating  entities, 
supplying  their  initial  behavior,  interacting  with  the  entities 
during  a  running  simulation,  and  reviewing  the  results.  The 
present  research  concentrates  on  the  first  two  of  these 
stages. 

Our  contribution  to  the  distributed  interactive  simulation 
(DIS)  effort  is  to  rethink  the  nature  of  the  user  interaction. 
As  with  most  modem  simulators,  DISs  are  controlled  via 
graphical  user  interfaces  (GUIs).  However,  the  simulation 
GUI  is  showing  signs  of  strain,  since  even  for  a  small-scale 
scenario,  it  requires  users  to  choose  from  hundreds  of 
entities  in  order  to  select  the  desired  ones  to  place  on  a  map. 
To  compound  these  interface  problems,  the  military  is 
intending  to  increase  the  scale  of  the  simulations 


1  NRaD  =  US  Navy  Command  and  Control  Ocean  Systems  Center 
Research  Development  Test  and  Evaluation  (San  Diego). 


dramatically,  while  at  the  same  time,  for  reasons  of 
mobility  and  affordability,  desiring  that  simulations  should 
be  creatable  from  small  devices  (e.g.,  PDAs).  This 
impending  collision  of  trends  for  smaller  screen  size  and  for 
more  entities  requires  a  different  paradigm  for  human- 
computer  interaction. 

We  have  argued  generically  that  GUI  technologies  offer 
advantages  in  allowing  users  to  manipulate  objects  that  are 
on  the  screen,  in  reminding  users  of  their  options,  and  in 
minimizing  errors  [7],  However,  GUIs  are  often  weak  in 
supporting  interactions  with  many  objects,  or  objects  not 
on  the  screen.  In  contrast,  it  was  argued  that  linguistically- 
based  interface  technologies  offer  the  potential  to  describe 
large  sets  of  objects,  which  may  not  all  be  present  on  a 
screen,  and  can  be  used  to  create  more  complex  behaviors 
through  specification  of  rule  invocation  conditions. 
Simulation  is  one  type  of  application  for  which  these 
limitations  of  GUIs,  as  well  as  the  strengths  of  natural 
language,  especially  spoken  language,  are  apparent  [6]. 

It  has  become  clear,  however,  that  speech-only  interaction  is 
not  optimal  for  spatial  tasks.  Using  a  high-fidelity 
"Wizard-of-Oz"  methodology  [20],  recent  empirical  results 
demonstrate  clear  language  processing  and  task  performance 
advantages  for  multimodal  (pen/voice)  input  over  speech- 
only  input  for  map-based  systems  [17,18], 

3.  QUICKSET 

To  address  these  simulation  interface  problems,  and 
motivated  by  the  above  results,  we  have  developed 
QuickSet  (see  Figure  1)  a  collaborative,  handheld, 
multimodal  system  for  configuring  military  simulations 
based  on  LeatherNet  [5],  a  system  used  in  training  platoon 
leaders  and  company  commanders  at  the  USMC  base  at  29 
Palms,  California.  LeatherNet  simulations  are  created  using 
the  ModSAF  simulator  [10]  and  can  be  visualized  in  a 
CAVE-based  virtual  reality  environment  [11,  26]  called 
CommandVu  (see  Figure  2  —  QuickSet  systems  are  on 
the  soldiers’  tables).  In  addition  to  LeatherNet,  QuickSet  is 
being  used  in  a  second  effort  called  Exlnit  (Exercise 
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Initialization),  that  will  enable  users  to  create  division-sized 
exercises.  Because  of  the  use  of  OAA,  QuickSet  can 
interoperate  with  agents  from  CommandTalk  [14],  which 
provides  a  speech-only  interface  to  ModSAF. 

QuickSet  runs  on  both  desktop  and  hand-held  PC’s, 
communicating  over  wired  and  wireless  LAN’s,  or  modem 
links.  The  system  combines  speech  and  pen-based  gesture 
input  on  multiple  3-lb  hand-held  PCs  (Fujitsu  Stylistic 
1000),  which  communicate  via  wireless  LAN  through  the 
Open  Agent  Architecture  (OAA)2  [8],  to  ModSAF,  and  also 
to  CommandVu.  With  this  highly  portable  device,  a  user 
can  create  entities,  establish  "control  measures"  (e.g., 
objectives,  checkpoints,  etc.),  draw  and  label  various  lines 
and  areas,  (e.g.,  landing  zones)  and  give  the  entities 
behavior. 


Figure  1 :  QuickSet  running  on  a  wireless  handheld  PC. 

In  the  remainder  of  the  paper,  we  illustrate  the  system 
briefly,  describe  its  components,  and  discuss  its  application. 


Figure  2:  Artist’s  rendition  of  QuickSet  used  with 
CommandVu  virtual  display  of  distributed 
interactive  simulation. 


4.  SYSTEM  ARCHITECTURE 

Architecturally,  QuickSet  uses  distributed  agent 
technologies  based  on  the  Open  Agent  Architecture  for 
interoperation,  information  brokering  and  distribution.  An 

2  Open  Agent  Architecture  is  a  trademark  of  SRI  International. 


agent-based  architecture  was  chosen  to  support  this 
application  because  it  offers  easy  connection  to  legacy 
applications,  and  the  ability  to  run  the  same  set  of  software 
components  in  a  variety  of  hardware  configurations,  ranging 
from  stand-alone  on  the  handheld  PC,  to  distributed 
operation  across  numerous  workstations  and  PCs. 
Additionally,  the  architecture  supports  mobility  in  that 
lighter  weight  agents  can  run  on  the  handheld,  while  more 
computationally-intensive  processing  can  be  migrated 
elsewhere  on  the  network.  The  agents  may  be  written  in 
any  programming  language  (here,  Quintus  Prolog,  Visual 
C++,  Visual  Basic,  and  Java),  as  long  as  they  communicate 
via  an  interagent  communication  language.  The 
configuration  of  agents  used  in  the  Quickset  system  is 
illustrated  in  Figure  3.  A  brief  description  of  each  agent 
follows: 

QuickSet  interface:  On  the  handheld  PC  is  a  geo- 
referenced  map  of  the  region  such  that  entities  displayed  on 
the  map  are  registered  to  their  positions  on  the  actual 
terrain,  and  thereby  to  their  positions  on  each  of  the  various 
user  interfaces  connected  to  the  simulation.  The  map 
interface  agent  provides  the  usual  pan  and  zoom  capabilities, 
multiple  overlays,  icons,  etc.  The  user  can  draw  directly  on 
the  map,  in  order  to  create  points,  lines,  and  areas.  The  user 
can  create  entities,  give  them  behavior,  and  watch  the 
simulation  unfold  from  the  handheld.  When  the  pen  is 
placed  on  the  screen,  the  speech  recognizer  is  activated, 
thereby  allowing  users  to  speak  and  gesture  simultaneously. 

Speech  recognition  agent:  The  speech  recognition 
agent  used  in  QuickSet  employs  either  IBM’s  VoiceType 
Application  Factory  or  VoiceType  3.0  recognizers.  The 
recognizers  use  an  HMM-based  continuous  speaker- 
independent  speech  recognition  technology  for  PC’s  under 
Windows  95/NT.  Currently,  the  system  has  a  vocabulary  of 
450  words.  It  produces  a  single  most  likely  interpretation 
of  an  utterance. 

Gesture  recognition  agent:  OGI’s  gesture  recognition 
agent  processes  all  pen  input  from  a  PC  screen  or  tablet. 
The  agent  weights  the  results  of  both  HMM  and  neural  net 
recognizers,  producing  a  combined  score  for  each  of  the 
possible  recognition  results.  Currently,  45  gestures  can  be 
recognized,  resulting  in  the  creation  of  21  military  symbols, 
irregular  shapes,  and  various  types  of  lines. 

Natural  language  agent:  The  natural  language  agent 
currently  employs  a  definite  clause  grammar  and  produces 
typed  feature  structures  as  a  representation  of  the  utterance 
meaning.  Currently,  for  this  task,  the  language  consists  of 
noun  phrases  that  label  entities,  as  well  as  a  variety  of 
imperative  constructs  for  supplying  behavior. 

Multimodal  integration  agent:  The  multimodal 

interpretation  agent  accepts  typed  feature  structure  meaning 
representations  from  the  language  and  gesture  recognition 
agents,  and  produces  a  unified  multimodal  interpretation. 
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QuickSet  Brokered  Architecture 


Figure  3:  A  blackboard  is  used  by  a  facilitator 
agent,  who  routes  queries  to  appropriate  agents  for 
solution. 


Simulation  agent:  The  simulation  agent,  developed 
primarily  by  SRI  International,  but  modified  by  us  for 
multimodal  interaction,  serves  as  the  communication 
channel  between  the  OAA-brokered  agents  and  the  ModSAF 
simulation  system.  This  agent  offers  an  API  for  ModSAF 
that  other  agents  can  use. 

Web  display  agent:  The  Web  display  agent  can  be  used 
to  create  entities,  points,  lines,  and  areas.  It  posts  queries 
for  updates  to  the  state  of  the  simulation  via  Java  code  that 
interacts  with  the  blackboard  and  facilitator.  The  queries  are 
routed  to  the  running  ModSAF  simulation,  and  the 
available  entities  can  be  viewed  over  a  WWW  connection 
using  a  suitable  browser. 

Other  user  interfaces:  When  another  user  interface 
connected  to  the  facilitator  subscribes  to  and  produces  the 
same  set  of  events  as  others,  it  immediately  becomes  part  of 
a  collaboration.  One  can  view  this  as  human-human 
collaboration  mediated  by  the  agent  architecture,  or  as  agent- 
agent  collaboration. 

CommandVu  agent:  Since  the  CommandVu  virtual 
reality  system  is  an  agent,  the  same  multimodal  interface  on 
the  handheld  PC  can  be  used  to  create  entities  and  to  fly  the 
user  through  the  3-D  terrain.  For  example,  the  user  can  ask 
“CommandVu,  fly  me  to  this  platoon  <gesture  on  the 
map>.” 

Application  bridge  agent:  The  bridge  agent 

generalizes  the  underlying  applications’  API  to  typed  feature 
structures,  thereby  providing  an  interface  to  the  various 
applications  such  as  ModSAF,  CommandVu,  and  Exinit. 
This  allows  for  a  domain-independent  integration 
architecture  in  which  constraints  on  multimodal 
interpretation  are  stated  in  terms  of  higher-level  constructs 
such  as  typed  feature  structures,  greatly  facilitating  reuse. 

CORBA  bridge  agent:  This  agent  converts  OAA 
messages  to  CORBA  IDL  (Interface  Definition  Language) 
for  the  Exercise  Initialization  project. 


More  detail  on  the  architecture  and  the  individual  agents  are 
provided  in  [12,  22]. 

5.  EXAMPLE 

Holding  QuickSet  in  hand,  the  user  views  a  map  from  the 
ModSAF  simulation,  and  with  spoken  language  coupled 
with  pen  gestures,  issues  commands  to  ModSAF.  In  order 
to  create  a  unit  in  QuickSet,  the  user  would  hold  the  pen  at 
the  desired  location  and  utter  (for  instance):  “red  T72 
platoon”  resulting  in  a  new  platoon  of  the  specified  type 
being  created  at  that  location. 


Figure  4:  The  QuickSet  interface  as  the  user 
establishes  two  platoons,  a  barbed-wire  fence,  a 
breached  minefield,  and  then  issues  a  command  to  one 
platoon  to  follow  a  traced  route. 

The  user  then  adds  a  barbed-wire  fence  to  the  simulation  by 
drawing  a  line  at  the  desired  location  while  uttering  “barbed 
wire.”  Similarly  a  fortified  line  is  added.  A  minefield  of  an 
amorphous  shape  is  drawn  and  is  labeled  verbally,  and 
finally  an  M1A1  platoon  is  created  as  above.  Then  the  user 
can  assign  a  task  to  the  new  platoon  by  saying  “M1A1 
platoon  follow  this  route”  while  drawing  the  route  with  the 
pen.  The  results  of  these  commands  are  visible  on  the 
QuickSet  screen,  as  seen  in  Figure  4,  in  the  ModSAF 
simulation,  and  in  the  CommandVu  3D  rendering  of  the 
scene.  In  addition  to  multimodal  input,  unimodal  spoken 
language  and  gestural  commands  can  be  given  at  any  time, 
depending  on  the  user’s  task  and  preference. 

6.  MULTIMODAL  INTEGRATION 

Since  any  unimodal  recognizer  will  make  mistakes,  the 
output  of  the  gesture  recognizer  is  not  accepted  as  a  simple 
unilateral  decision.  Instead  the  recognizer  produces  a  set  of 
probabilities,  one  for  each  possible  interpretation  of  the 
gesture.  The  recognized  entities,  as  well  as  their 
recognition  probabilities,  are  sent  to  the  facilitator,  which 
forwards  them  to  the  multimodal  interpretation  agent.  In 
combining  the  meanings  of  the  gestural  and  spoken 
interpretations,  we  attempt  to  satisfy  an  important  design 
consideration,  namely  that  the  communicative  modalities 
should  compensate  for  each  other’s  weaknesses  [7,  16]. 
This  is  accomplished  by  selecting  the  highest  scoring 
unified  interpretation  of  speech  and  gesture.  Importantly, 
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the  unified  interpretation  might  not  include  the  highest 
scoring  gestural  (or  spoken  language)  interpretation  because 
it  might  not  be  semantically  compatible  with  the  other 
mode.  The  key  to  this  interpretation  process  is  the  use  of  a 
typed  feature  structure  [1,  3]  as  a  meaning  representation 
language  that  is  common  to  the  natural  language  and 
gestural  interpretation  agents.  Johnston  et  al.  [12]  present 
the  details  of  multimodal  integration  of  continuous  speech 
and  pen-based  gesture,  guided  by  research  in  users’ 
multimodal  integration  and  synchronization  strategies  [19]. 
Unlike  many  previous  approaches  to  multimodal  integration 
(e.g,  [2,  9,  12,  15,  25])  speech  is  not  “in  charge,”  in  the 
sense  of  relegating  gesture  a  secondary  and  dependent  role. 
This  mutually-compensatory  interpretation  process  is 
capable  of  analyzing  multimodal  constructions,  as  well  as 
speech-only  and  pen-only  constructions  when  they  occur. 
Vo  and  Wood’s  system  [24]  is  similar  to  the  one  reported 
here,  though  we  believe  the  use  of  typed  feature  structures 
provides  a  more  generally  usable  and  formal  integration 
mechanism  than  their  frame-merging  strategy.  Cheyer  and 
Julia  [4]  sketch  a  system  based  on  Oviatt’s  [17]  results  and 
the  OAA  [8],  but  do  not  discuss  the  integration  strategy  nor 
multimodal  compensation. 

7.  CONCLUDING  REMARKS 

QuickSet  has  been  delivered  to  the  US  Navy  (NRaD)  and 
US  Marine  Corps,  for  use  at  29  Palms,  California,  where  it 
is  primarily  used  to  set  up  training  scenarios  and  to  control 
the  virtual  environment.  It  is  also  installed  at  NRaD’s 
Command  Center  of  the  Future.  The  system  was  used  by 
the  US  Army’s  82nd  Airborne  Corps,  at  Ft.  Bragg  during 
the  Royal  Dragon  Exercise.  There,  QuickSet  was  deployed 
in  a  tent,  where  it  was  subjected  to  an  extreme  noise 
environment,  including  explosions,  low-flying  jet  aircraft, 
generators,  and  the  like.  Not  surprisingly,  spoken 
interaction  with  QuickSet  was  not  feasible,  although  users 
gestured  successfully.  Instead,  users  wanted  to  gesture. 
Although  we  had  provided  a  multimodal  interface  for  use  in 
less  hostile  conditions,  nevertheless  we  needed  to 
provide,and  in  fact  have  provided,  a  complete  overlap  in 
functionality,  such  that  any  task  can  be  accomplished  just 
with  pen  or  just  with  speech  when  necessary.  Finally, 
QuickSet  is  now  being  extended  for  use  in  the  Exlnit 
simulation  initialization  system  for  DARPA’s  STOW-97 
Advanced  Concept  Demonstration  that  is  intended  for 
creation  of  division-sized  exercises. 

Regarding  the  multimodal  interface  itself,  QuickSet  has 
undergone  a  “proactive”  interface  evaluation  in  that  the 
studies  that  were  performed  in  advance  of  building  the 
system  predicted  the  utility  of  multimodal  over  unimodal 
speech  as  an  input  to  map-based  systems  [17,  18].  In 
particular,  it  was  discovered  in  this  research  that  multimodal 
interaction  generates  simpler  language  than  unimodal 
spoken  commands  to  maps.  For  example,  to  create  a 
“phase  line”  between  two  three-digit  <x,y>  grid  coordinates, 
a  user  would  have  to  say:  “create  a  line  from  nine  four  three 
nine  six  one  to  nine  five  seven  nine  six  eight  and  call  it 
phase  line  green”  [14],  In  contrast,  a  QuickSet  user  would 
say  “phase  line  green”  while  drawing  a  line.  Creation  of  area 


features  with  unimodal  speech  would  be  more  complex  still, 
if  not  infeasible.  Given  that  numerous  difficult-to-process 
linguistic  phenomena  (such  as  utterance  disfluencies)  ate 
known  to  be  elevated  in  lengthy  utterances,  and  also  to  be 
elevated  when  people  speak  locative  constituents  [17,  18], 
multimodal  interaction  that  permits  pen  input  to  specify 
locations  and  that  results  in  brevity  offers  the  possibility  of 
more  robust  recognition. 

Further  development  of  QuickSet’s  spoken,  gestural,  and 
multimodal  integration  capabilites  are  continuing.  Research 
is  also  ongoing  to  examine  and  quantify  the  benefits  of 
multimodal  interaction  in  general,  and  our  architecture  in 
particular. 
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